**Code Documentation for Why is End-of-Life Spending So High? Evidence from Cancer Patients**
**Dan Zeltzer, Liran Einav, Tzvi Shir, Salomon Stemmers, Ran D. Balicer**

This README.txt file provides information on the data and code used to produce all exhibits.


# Data Overview
The data used for this research are from Clalit Health Services, Clalit Research Institute, 40 Tuval Street, Ramat-Gan, Israel 6578898. 
These data contain individual administrative data (health insurance and electronic health records), 
so they are both proprietary and contain confidential patient information. 
To protect the privacy of patients, we do not post the data online. 
However, interested researchers can apply for data access for replication-related purposes, 
under the appropriate arrangements for protecting patient confidentiality and research ethics. 
Email inquiries can be directed to the corresponding author, Dan Zeltzer, at dzeltzer@tauex.tau.ac.il.


# Software Requirements
The code is written in R and was run on an RStudio server with the following specs:
R version 3.6.2
Platform: x86_64-pc-linux-gnu (65-bit)
Running under: Debian GNU/Linux 10 (buster)
R packages: tidyverse, tidymodels, workflows, magrittr, recipes, data.table, ggplot2, 
knitr, kableExtra, plotROC, readr, zeallot, DBI, odbc, Cairo, igraph, stargazer, tibble, RColorBrewer, digest, viridis

Aprroximate runtime: 8 hours.


# Code Overview
The analysis consists of three main steps. 
1. Training of prognosis algorithms
2. Analysis of the relationship of algorithm-produced prognoses and different spending measures. 
3. Production of paper exhibits.

Initial data are ingested in the form of SQL tables. 
The section titled "Dataset List" below describes the SQL tables from which we read in the data. 
All code files are provided. Below is a description of code dependencies. 


## Algorithm Training
The following scripts perform the mortality predictions for different samples and save them as .RData objects that are read in by the analysis scripts

1. model/cancer_prediction_initial.R
  Purpose: creates prediction for cancer sample (index date)
  Inputs: 
	FeaturesExtraction
	tempTable_sampleIds
	FeaturesExtraction_3rdTable
	FeaturesExtraction_4thTable
	FeaturesExtraction2
	FeaturesExtraction_events
	FeaturesExtraction_events_2
  Outputs: 
	revision_p0.RData
	p0_canc_AUC.csv

2. model/cancer_prediction_monthly.R 
  Purpose: Create prediction for monthly cancer sample 
  Inputs: 
	FeaturesExtraction
	tempTable_sampleIds
	FeaturesExtraction_3rdTable
	FeaturesExtraction_4thTable
	FeaturesExtraction2
	FeaturesExtraction_events
	FeaturesExtraction_events_2
  Outputs: 
	revision_monthly.RData
	monthly_canc_AUC.csv

3. model/cancer_prediction_events.R 
  Purpose: Create prediction for events sample 
  Inputs: 
	FeaturesExtraction
	tempTable_sampleIds
	FeaturesExtraction_3rdTable
	FeaturesExtraction_4thTable
	FeaturesExtraction2
	FeaturesExtraction_events
	FeaturesExtraction_events_2
  Outputs: 
	revision_events_intensity.RData
	canc_revision_events_AUC.csv


4. model/cancer_prediction_clalit_admissions.R
  Purpose: Create prediction of clalit admissions-  with and without EMR data (include data from first 48 hours of admission)
  Inputs: 
	FeaturesExtraction
	tempTable_sampleIds
	FeaturesExtraction_3rdTable
	FeaturesExtraction_4thTable
	FeaturesExtraction2
	FeaturesExtraction_events
	FeaturesExtraction_events_2
	AdmissionFeatures
	revision_p0.RData
  Outputs: 
	revision_clalit_all_model.RData
	revision_clalit_all_model_noadm.RData
	revision_clalit_all_model_AUC.csv
	revision_clalit_noEMR_model_AUC.csv


5. Supporting scripts for aformentioned model scripts. 
Their names specify the purpose of the included functions. 
They are sourced at the beginning of the model scripts:
analysis.R, data_retrieval.R, modeling.R, preprocess.R 
(conncection.R, which includes the functions for reading in data from the SQL database, is left out of the submission for reasons of data privacy)


## Analysis and Exhibit Production. 
Script 6 sources all of the scripts in the Model/functions folder and in the analysis folder 
which contain functions for preprocessing, analysis, and generating all of the figures and tables. 
It then calls all of these functions to create most of the figures and tables :

6. DataProc_analysis.R
  Purpose: Generates almost all figures and tables by sourcing functions in scripts in Model/functions folder and calling the functions
  Inputs:
	canc_revision_events_AUC.csv
	monthly_canc_AUC.csv
	revision_clalit_all_model_AUC.csv
	revision_clalit_noEMR_model_AUC.csv
	p0_canc_AUC.csv
	revision_p0.RData
  Outputs: 
	AUC_all_mods.csv
	event_tree.csv
	Spending_byPredictedMortality_cancer_presentation - Figure 3
	MonthlySpending_byPhat_andDeath_cancer_presentation - Figure 3
	Average_Monthly_Spending_by_Type_Low_High - Figure 4
	Cacner_Decedent_Spending_By_Time_aLow_High - Figure 7
	Cacner_Decedent_Spending_By_time_bLow_High - Figure 7
	Death_within_60_days_after_adm_Intensity_events_Pred - Figure 8
	MonthlySpending_byPhat_andDeath_ageQ - Figure 9
	desc_stats_edited - Table 1
	avg_monthly_cost_revision_table_2_phat_month - Table 2
	adm_stat_agg_cancermonth - Table 3
	proc_by_time_before_death - Table 4
	Avg_Spending_Canc_by_age_Quintileby_month - Table 5
	conc_of_cost_version_B - Figure A4
	conc_100_bins_death_adj - Figure A5
	calib_age_group - Figure A8
	dyn_bplot_by_month - Figure A9
	top_10_wards_intensity_desc - Table A1
	Admiss_Char_by_Hospital_Ownership - Table A2
	table_mainTopo_descStats_edited_new - Table A3
	sum_stats_drg_diem - Table A4
	Ave_Monthly_Spending_Cancer_Alternative_Admission_Grouping - Table A5
	Procedures_planned_intensity - Table A6
	desc_stats_train_test - Table A7
	ave_monthly_cost_dynamic_risk_bins_month - Table A8
	top_10_wards_intensity_desc_DS - Table A9
	sum_stats_proc - Table A11


7. figure_1.R
  purpose: Generates figure 1
  Inputs:
  Outputs: 
	s_on_theta.pdf - Figure 1
	s_on_f_and_theta.pdf - Figure 1

8. event_exhibits.R 
  Generates figures and tables associated with event level analysis
  Inputs:
    event_tree.csv
  Outputs:
	ev_fig_number_of_patient_by_event_type_Within_history_depth_stacked - Figure 2
	ev_fig_number_of_patient_by_event_type_within_history_depth_percent - Figure 2
	ev_fig_delta_monthly_avg_1yr_fwd_cost_excl_index_over_delta_phat_binned_errorbars - Figure 5
	ev_fig_delta_monthly_avg_1yr_fwd_cost_excl_index_over_delta_phat_by_current_phat_binned - Figure 5
	ev_fig_phat_parent_phat_child_by_event_type_binned - Figure 6
	ev_AGE_fig_index_spending_on_current_phat_by_age_and_event_type - Figure 10
	ev_AGE_fig_fwd_spending_on_current_phat_by_age_and_event_type - Figure 10
	ev_fig_number_of_patient_by_event_type_within_history_depth_and_main_cancer_type_percent - Figure A1
	ncases_by_phat_and_month_by_group - Figure A2
	spending_on_phat_and_month_by_group - Figure A3
	ev_fig_delta_monthly_avg_1yr_fwd_cost_excl_index_over_delta_phat_by_current_phat_quinitile_binned - Figure A6
	ev_fig_number_of_patient_by_event_type_within_history_depth_percent_by_age - Figure A7
	ev_AGE_tab_summary - Table A10


9. Supporting scripts for generated figures and tables with functions (sourced from other scripts).
   These scripts do not need to be run directly. They are sourced from the relevant exec scripts described above :
All scripts in Analysis folder

The list below provides the name of the scripts which contain the function that produces each exhibit.
fig_MonthlySpending_byPhat_andDeath.R - Figure 3 (FE_Spending_byPredictedMortality_cancer_presentation, MonthlySpending_byPhat_andDeath_cancer_presentation)
Average_Monthly_Spending_by_Type.R - Figure 4 (Average_Monthly_Spending_by_Type_Low_High)
fig_Cancer_Decedent_spending_by_Time.R - Figure 7 (fig_Cacner_Decedent_Spending_By_Time_aLow_High, fig_Cacner_Decedent_Spending_By_time_bLow_High)
do_figure60_days_events.R - Figure 8 (Death_within_60_days_after_adm_Intensity_events_Pred)
fig_MonthlySpending_byPhat_andDeath_ageQ.R - Figure 9 (MonthlySpending_byPhat_andDeath_ageQ)
table_1.R - Table 1 (FE_desc_stats_edited)
table_2.R - Table 2 (ave_monthly_cost_revision_table_2_phat_month)
table_3.R - Table 3 (FE_adm_stat_agg_cancermonth)
table_4.R - Table 4 (tab_proc_by_time_before_death)
table_5_month.R - Table 5 (Avg_Spending_Canc_by_age_Quintileby_month)
fig_concentration.R - Figure A4 (conc_of_cost_version_B)
fig_concetration_100_bins.R - Figure A5 (conc_100_bins_death_adj)
Lines 482-511 of DataProc_analysis.R - Figure A8 (calib_age_group)
fig_dyn_boxplot_by_month.R - Figure A9 (dyn_bplot_by_month)
tab_top_10_wards_intensity_desc.R - Table A1 (top_10_wards_intensity_desc)
tab_Adm_Char_by_Hospital_Ownership.R - Table A2 (Admiss_Char_by_Hospital_Ownership)
tab_mainTopo_descStats.R - Table A3 (table_mainTopo_descStats_edited_new)
table_2_procedure.R - Table A4 (sum_stats_drg_diem)
tab_Alternative_Adm_Grouping.R - Table A5 (Ave_Monthly_Spending_Cancer_Alternative_Admission_Grouping)
tab_Procedures_planned_intensity.R - Table A6 (tProcedures_planned_intensity)
tab_Select_Predictors.R - Table A7 (desc_stats_train_test)
tab_Ave_monthly_spending_cancer.R - Table A8 (ave_monthly_cost_dynamic_risk_bins_month)
tab_top_10_wards_intensity_desc.R - Table A9 (top_10_wards_intensity_desc_DS)
table_2_procedure.R - Table A11 (sum_stats_proc)


# Dataset List

## SQL Datasets 1 and 2 are of our main samples: 
1. sample at level of id-dx (Consists of several different samples: Admissions as index date, Cancer diagnosis as index date, 
		 12 copies of the diagnosis rows with index date 1-12 months from first diagnosis,
		 full Clalit population regardless of cancer diagnosis) 
SQL table name: tempTable_sampleIDs
Columns:
id_var - ID number
index_date - Index Date (For admissions data it's the admission date. For the diagnosis data, it's the date of diagnosis. 
			 For the copies of the diagnosis data, it's 1-12 months after the diagnosis date.
			 For the full population, it's 2013-01-01.)
zihui_ishpuz_bikur - Identifier of hospital visit (Only exists for the admissions data)
sample_source - Dataset from which sourced (adm_data,all_data,cnr_data,cnr_data1,...,cnr_data12)
index_date_365daysAfter - 365 Days after index date
index_date_365daysBefore - 365 Days before index date
index_date_730daysBefore - 730 Days before index date
index_date_1825daysBefore - 1825 days before index date
continuous_membership_365Before_730After - Indicator for continuous membership from 365 days before to 730 days after index date
continuous_membership_730Before_730After - Indicator for continuous membership from 730 days before to 730 days after index date
continuous_membership_1825Before_730After - Indicator for continuous membership from 1825 days before and 730 days after index date
age - age at index date
original_index_date - original index date (Same as index date except for cnr_data1-cnr_data12 in which case it's the diagnosis date).

2. sample with features at level of id-dx (by sample). Identical number of rows as above table but with 100's of additional features as described in the paper.
SQL table name: FeaturesExtraction

## SQL Datasets 3-6 contain all healthcare utilization (counts and cost) of specified samples over specified periods: 

3. utilization 1 year after (by sample). For each id-index date pair in tempTable_sampleIDs excluding cnr_data1-cnr_data12, 
					 extraction results of all utilization between index date and 1 yr after index date and associated cost. 
					 Each row corresponds to a health care action (lab test, hospital admission, etc.)

SQL table name: tempTable_rawdata
Columns:
id_var - ID Number
date_start - Date of beginning of period queried
date_end - Date of end of period queried
event_date - Date of beginning of event
event_date_end - Date of end of event
zihui_bikur - Identifier of event
source - Source table for event
category - Category of event
profession - Subcategory of event
amount - Number of events included in row
actual_cost - Actual cost paid out for event
actual_cost_memudad - Cost paid out for event in today's NIS
normalized_cost - Cost paid out for event normalized by Health Ministry's price list on event date
normalized_cost_today - Cost paid out for event normalized by Health Ministry's price list today


4. aggregated utilization 1 year after dx (by sample): 
	Aggregated version of tempTable_rawdata by id_var, date_start, event_date, category, profession.

SQL table name: FeaturesExtraction_3rdTable
Columns:
id_var - ID Number
S_index_date_XX_year - Year of index date
S_index_date_XX_month - Month of index date
S_index_date_XX_day - Day of index date
event_date_year - Year of event
event_date_month - Month of event
event_date_day - Day of event
event_date_days - Length of event in days
category - Category of event
profession - Subcategory of event
main_cat - Alternate category of event
any_null_in_cost_cols - Missing cost
amount - Number of events included in row
actual_cost - Actual cost paid out for event
actual_cost_memudad - Cost paid out for event in today's NIS
normalized_cost - Cost paid out for event normalized by Health Ministry's price list on event date
normalized_cost_today - Cost paid out for event normalized by Health Ministry's price list today


5. utilization 1 year before dx (by sample). For each id-index date pair in tempTable_sampleIDs, 
					 extraction results of all utilization between one year before index date and day before index date and associated cost. 
					 Each row corresponds to a health care action (lab test, hospital admission, etc.)

SQL table name: tempTable_rawdata_1yrBeforeIndexDate
Columns:
id_var - ID Number
date_start - Date of beginning of period queried
date_end - Date of end of period queried
event_date - Date of beginning of event
event_date_end - Date of end of event
zihui_bikur - Identifier of event
source - Source table for event
category - Category of event
profession - Subcategory of event
amount - Number of events included in row
actual_cost - Actual cost paid out for event
actual_cost_memudad - Cost paid out for event in today's NIS
normalized_cost - Cost paid out for event normalized by Health Ministry's price list on event date
normalized_cost_today - Cost paid out for event normalized by Health Ministry's price list today

6. utilization year 2 after dx (by sample). For each row in tempTable_sampleIDs excluding cnr_data1-cnr_data12, 
					 extraction results of all utilization between 366 days after index date and 2 yr after index date and associated cost. 
					 Each row is a health care action (lab test, hospital admission, etc.)

SQL table name: tempTable_rawdata_2nd_year
Columns:
id_var - ID Number
date_start - Date of beginning of period queried
date_end - Date of end of period queried
event_date - Date of beginning of event
event_date_end - Date of end of event
zihui_bikur - Identifier of event
source - Source table for event
category - Category of event
profession - Subcategory of event
amount - Number of events included in row
actual_cost - Actual cost paid out for event
actual_cost_memudad - Cost paid out for event in today's NIS
normalized_cost - Cost paid out for event normalized by Health Ministry's price list on event date
normalized_cost_today - Cost paid out for event normalized by Health Ministry's price list today
 
## SQL Dataset for analysis at the event level: 
 
7. events sample for prediction :  ER, imaging, hospitalizations, radiation, oncology drugs within year of cancer diagnosis date.

SQL table name: events_for_prediction 
Columns:
id_var - ID Number
S_index_date_XX - Index Date
category - Category of event
profession - Subcategory of event
event_date - Date of beginning of event
event_date_end - Date of end of event
 
8. revision cancer sample with events: Rows for both 12 monthly rows from diagnosis date and all events within year of diagnosis. Full set of 100's of features

SQL table name: FeaturesExtraction2 


9. final events with features: Rows for all events from events_for_prediction except imaging. Full set of 100's of features. 

SQL table name: FeaturesExtraction_events_2
